Archivo german con 1000 observaciones, que se puede obtener en: http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data.
Este set de datos clasifica a clientes como buenos o malos riesgos para otorgarles créditos bancarios.
La base consta de 21 variables que corresponden a diferentes observaciones sobre el historial crediticio de una muestra de personas.
| campo | atributo | valores | tipo |
|---|---|---|---|
| 1 | estatus de su cuenta actual | A11,A12,A13,A14 | categórica |
| 2 | duración en meses | enteros positivos | numérica |
| 3 | historia creditica | A30,A31,A32,A33,A34 | categórica |
| 4 | propósito | A40,A41,A42,A43,A44,A45,A46,A47,A48,A49,A410 | categórica |
| 5 | monto crediticio | reales positivos | numérica |
| 6 | cuenta de ahorro/bonos | A61,A62,A63,A64,A65 | categórica |
| 7 | empleo actual desde | A71,A72,A73,A74,A75 | categórica |
| 8 | porcentaje de capacidad de pago | reales positivos | numérica |
| 9 | estado civil y género | A91,A92,A93,A94,A95 | categórica |
| 10 | otros acreedores | A101,A102,A103 | categórica |
| 11 | vivienda actual desde | entero positivos | numérica |
| 12 | propiedad | A121,A122,A123,A124 | categórica |
| 13 | edad en años | enteros positivos | numérica |
| 14 | otros planes de pago | A141,A142,A143 | categórica |
| 15 | tipo de vivienda | A151,A152,A153 | categórica |
| 16 | número de créditos que tiene en este banco | enteros positivos | numérica |
| 17 | trabajo | A171,A172,A173,A174,A175 | categórica |
| 18 | número de dependientes económicos | enteros positivos | numérica |
| 19 | teléfono | A191,A192 | categórica |
| 20 | trabajador extranjero | A201,A202 | categórica |
| 21 | crédito bueno/crédito malo | 1,2 | categórica |
Se muestra la estructura general de la base
## Source: local data frame [1,000 x 21]
##
## Status of existing checking account Duration in month Credit history
## 1 A11 6 A34
## 2 A12 48 A32
## 3 A14 12 A34
## 4 A11 42 A32
## 5 A11 24 A33
## 6 A14 36 A32
## 7 A14 24 A32
## 8 A12 36 A32
## 9 A14 12 A32
## 10 A12 30 A34
## .. ... ... ...
## Variables not shown: Purpose (chr), Credit amount (int), Savings
## account/bonds (chr), Present employment since (chr), Installment rate in
## percentage of disposable income (int), Personal status and sex (chr),
## Other debtors / guarantors (chr), Present residence since (int),
## Property (chr), Age in years (int), Other installment plans (chr),
## Housing (chr), Number of existing credits at this bank (int), Job (chr),
## Number of people being liable to provide maintenance for (int),
## Telephone (chr), foreign worker (chr), Good.Loan (int)
Podemos observar las 21 diferentes variables, donde 13 de ellas parecen ser categóricas y el resto numéricas (8 variables), aunque una de las variables numéricas en realidad es categórica.
Se muestra la dimensión de la base.
## [1] 1000 21
La base contiene 1000 registros para cada una de las 21 variables.
Se muestran los nombres de la base.
## [1] "Status of existing checking account"
## [2] "Duration in month"
## [3] "Credit history"
## [4] "Purpose"
## [5] "Credit amount"
## [6] "Savings account/bonds"
## [7] "Present employment since"
## [8] "Installment rate in percentage of disposable income"
## [9] "Personal status and sex"
## [10] "Other debtors / guarantors"
## [11] "Present residence since"
## [12] "Property"
## [13] "Age in years"
## [14] "Other installment plans"
## [15] "Housing"
## [16] "Number of existing credits at this bank"
## [17] "Job"
## [18] "Number of people being liable to provide maintenance for"
## [19] "Telephone"
## [20] "foreign worker"
## [21] "Good.Loan"
Se muestra la estructura de la base.
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 21 variables:
## $ Status of existing checking account : chr "A11" "A12" "A14" "A11" ...
## $ Duration in month : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Credit history : chr "A34" "A32" "A34" "A32" ...
## $ Purpose : chr "A43" "A43" "A46" "A42" ...
## $ Credit amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Savings account/bonds : chr "A65" "A61" "A61" "A61" ...
## $ Present employment since : chr "A75" "A73" "A74" "A74" ...
## $ Installment rate in percentage of disposable income : int 4 2 2 2 3 2 3 2 2 4 ...
## $ Personal status and sex : chr "A93" "A92" "A93" "A93" ...
## $ Other debtors / guarantors : chr "A101" "A101" "A101" "A103" ...
## $ Present residence since : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Property : chr "A121" "A121" "A121" "A122" ...
## $ Age in years : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Other installment plans : chr "A143" "A143" "A143" "A143" ...
## $ Housing : chr "A152" "A152" "A152" "A153" ...
## $ Number of existing credits at this bank : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Job : chr "A173" "A173" "A172" "A173" ...
## $ Number of people being liable to provide maintenance for: int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : chr "A192" "A191" "A191" "A191" ...
## $ foreign worker : chr "A201" "A201" "A201" "A201" ...
## $ Good.Loan : int 1 2 1 1 2 1 1 1 1 2 ...
Tenemos que la base se encuentra codificada, por lo que no podemos saber si hay alguna discrepancia entre los datos y los correspondientes nombres de las variables, por lo que más adelante se llevará a cabo una decodificación.
Se muestran las primeras observaciones que contiene la base.
## Source: local data frame [6 x 21]
##
## Status of existing checking account Duration in month Credit history
## 1 A11 6 A34
## 2 A12 48 A32
## 3 A14 12 A34
## 4 A11 42 A32
## 5 A11 24 A33
## 6 A14 36 A32
## Variables not shown: Purpose (chr), Credit amount (int), Savings
## account/bonds (chr), Present employment since (chr), Installment rate in
## percentage of disposable income (int), Personal status and sex (chr),
## Other debtors / guarantors (chr), Present residence since (int),
## Property (chr), Age in years (int), Other installment plans (chr),
## Housing (chr), Number of existing credits at this bank (int), Job (chr),
## Number of people being liable to provide maintenance for (int),
## Telephone (chr), foreign worker (chr), Good.Loan (int)
Se muestran las últimas observaciones que contiene la base.
## Source: local data frame [6 x 21]
##
## Status of existing checking account Duration in month Credit history
## 995 A14 12 A32
## 996 A14 12 A32
## 997 A11 30 A32
## 998 A14 12 A32
## 999 A11 45 A32
## 1000 A12 45 A34
## Variables not shown: Purpose (chr), Credit amount (int), Savings
## account/bonds (chr), Present employment since (chr), Installment rate in
## percentage of disposable income (int), Personal status and sex (chr),
## Other debtors / guarantors (chr), Present residence since (int),
## Property (chr), Age in years (int), Other installment plans (chr),
## Housing (chr), Number of existing credits at this bank (int), Job (chr),
## Number of people being liable to provide maintenance for (int),
## Telephone (chr), foreign worker (chr), Good.Loan (int)
Se muestra un grupo de observaciones que contiene la base que fueron seleccionadas aleatoriamente.
## Source: local data frame [6 x 21]
##
## Status of existing checking account Duration in month Credit history
## 348 A12 24 A32
## 374 A14 60 A34
## 860 A14 9 A32
## 97 A14 12 A34
## 959 A11 28 A32
## 922 A14 48 A33
## Variables not shown: Purpose (chr), Credit amount (int), Savings
## account/bonds (chr), Present employment since (chr), Installment rate in
## percentage of disposable income (int), Personal status and sex (chr),
## Other debtors / guarantors (chr), Present residence since (int),
## Property (chr), Age in years (int), Other installment plans (chr),
## Housing (chr), Number of existing credits at this bank (int), Job (chr),
## Number of people being liable to provide maintenance for (int),
## Telephone (chr), foreign worker (chr), Good.Loan (int)
Se muestra el resumen estadístico de la base, donde aparentemente no se observa algo raro.
## Status of existing checking account Duration in month Credit history
## Length:1000 Min. : 4.0 Length:1000
## Class :character 1st Qu.:12.0 Class :character
## Mode :character Median :18.0 Mode :character
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## Purpose Credit amount Savings account/bonds
## Length:1000 Min. : 250 Length:1000
## Class :character 1st Qu.: 1366 Class :character
## Mode :character Median : 2320 Mode :character
## Mean : 3271
## 3rd Qu.: 3972
## Max. :18424
## Present employment since
## Length:1000
## Class :character
## Mode :character
##
##
##
## Installment rate in percentage of disposable income
## Min. :1.00
## 1st Qu.:2.00
## Median :3.00
## Mean :2.97
## 3rd Qu.:4.00
## Max. :4.00
## Personal status and sex Other debtors / guarantors
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Present residence since Property Age in years
## Min. :1.00 Length:1000 Min. :19.0
## 1st Qu.:2.00 Class :character 1st Qu.:27.0
## Median :3.00 Mode :character Median :33.0
## Mean :2.85 Mean :35.5
## 3rd Qu.:4.00 3rd Qu.:42.0
## Max. :4.00 Max. :75.0
## Other installment plans Housing
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Number of existing credits at this bank Job
## Min. :1.00 Length:1000
## 1st Qu.:1.00 Class :character
## Median :1.00 Mode :character
## Mean :1.41
## 3rd Qu.:2.00
## Max. :4.00
## Number of people being liable to provide maintenance for
## Min. :1.00
## 1st Qu.:1.00
## Median :1.00
## Mean :1.16
## 3rd Qu.:1.00
## Max. :2.00
## Telephone foreign worker Good.Loan
## Length:1000 Length:1000 Min. :1.0
## Class :character Class :character 1st Qu.:1.0
## Mode :character Mode :character Median :1.0
## Mean :1.3
## 3rd Qu.:2.0
## Max. :2.0
Dado que la base no cuenta con una columna correspondiente al id, agregamos la variable correspondiente.
Verificamos el formato de los nombres de las variables y observamos que no todos tienen el formato deseado, por lo que aplicaremos una limpieza.
## [1] "Status of existing checking account"
## [2] "Duration in month"
## [3] "Credit history"
## [4] "Purpose"
## [5] "Credit amount"
## [6] "Savings account/bonds"
## [7] "Present employment since"
## [8] "Installment rate in percentage of disposable income"
## [9] "Personal status and sex"
## [10] "Other debtors / guarantors"
## [11] "Present residence since"
## [12] "Property"
## [13] "Age in years"
## [14] "Other installment plans"
## [15] "Housing"
## [16] "Number of existing credits at this bank"
## [17] "Job"
## [18] "Number of people being liable to provide maintenance for"
## [19] "Telephone"
## [20] "foreign worker"
## [21] "Good.Loan"
Realizamos la recodificación correspondiente para entender mejor cada una de las variables.
Se muestran las diferentes clases de las variables.
## Status of existing checking account
## "character"
## Duration in month
## "integer"
## Credit history
## "character"
## Purpose
## "character"
## Credit amount
## "integer"
## Savings account/bonds
## "character"
## Present employment since
## "character"
## Installment rate in percentage of disposable income
## "integer"
## Personal status and sex
## "character"
## Other debtors / guarantors
## "character"
## Present residence since
## "integer"
## Property
## "character"
## Age in years
## "integer"
## Other installment plans
## "character"
## Housing
## "character"
## Number of existing credits at this bank
## "integer"
## Job
## "character"
## Number of people being liable to provide maintenance for
## "integer"
## Telephone
## "character"
## foreign worker
## "character"
## Good.Loan
## "integer"
Tenemos que hay varias variables de tipo caracter y otras numéricas, aparentemente el único ajuste que debemos realizar es definir como factores a las variables tipo caracter, para que posteriormente puedan ser utilizadas en el análisis.
Removemos espacios, puntuaciones, etc. en los niveles de los factores. En este caso, además de la normalización se lleva a cabo un cambio en la variable Good.Loan, que aparentemente es numérica pero realmente representa 2 categorías.
Aparentemente no es necesario hacer ninguna transformación a los datos. En el análisis gráfico trataremos de identificar si se requiere alguna transformación.
Una vez aplicado nuestro proceso de preparación para limpieza de metadatos y ajuste de formatos tenemos lo siguiente:
Verificamos como quedan los nombres de las variables de la base.
## [1] "status.of.existing.checking.account"
## [2] "duration.in.month"
## [3] "credit.history"
## [4] "purpose"
## [5] "credit.amount"
## [6] "savings.account.bonds"
## [7] "present.employment.since"
## [8] "installment.rate.in.percentage.of.disposable.income"
## [9] "personal.status.and.sex"
## [10] "other.debtors...guarantors"
## [11] "present.residence.since"
## [12] "property"
## [13] "age.in.years"
## [14] "other.installment.plans"
## [15] "housing"
## [16] "number.of.existing.credits.at.this.bank"
## [17] "job"
## [18] "number.of.people.being.liable.to.provide.maintenance.for"
## [19] "telephone"
## [20] "foreign.worker"
## [21] "good.loan"
## [22] "id"
Así queda el ajuste sobre el tipo de variables.
## status.of.existing.checking.account
## "factor"
## duration.in.month
## "integer"
## credit.history
## "factor"
## purpose
## "factor"
## credit.amount
## "integer"
## savings.account.bonds
## "factor"
## present.employment.since
## "factor"
## installment.rate.in.percentage.of.disposable.income
## "integer"
## personal.status.and.sex
## "factor"
## other.debtors...guarantors
## "factor"
## present.residence.since
## "integer"
## property
## "factor"
## age.in.years
## "integer"
## other.installment.plans
## "factor"
## housing
## "factor"
## number.of.existing.credits.at.this.bank
## "integer"
## job
## "factor"
## number.of.people.being.liable.to.provide.maintenance.for
## "integer"
## telephone
## "factor"
## foreign.worker
## "factor"
## good.loan
## "factor"
## id
## "integer"
Y nuevamente observamos el sumario estadístico.
## status.of.existing.checking.account duration.in.month credit.history
## a11:274 Min. : 4.0 a30: 40
## a12:269 1st Qu.:12.0 a31: 49
## a13: 63 Median :18.0 a32:530
## a14:394 Mean :20.9 a33: 88
## 3rd Qu.:24.0 a34:293
## Max. :72.0
##
## purpose credit.amount savings.account.bonds
## a43 :280 Min. : 250 a61:603
## a40 :234 1st Qu.: 1366 a62:103
## a42 :181 Median : 2320 a63: 63
## a41 :103 Mean : 3271 a64: 48
## a49 : 97 3rd Qu.: 3972 a65:183
## a46 : 50 Max. :18424
## (Other): 55
## present.employment.since
## a71: 62
## a72:172
## a73:339
## a74:174
## a75:253
##
##
## installment.rate.in.percentage.of.disposable.income
## Min. :1.00
## 1st Qu.:2.00
## Median :3.00
## Mean :2.97
## 3rd Qu.:4.00
## Max. :4.00
##
## personal.status.and.sex other.debtors...guarantors
## a91: 50 a101:907
## a92:310 a102: 41
## a93:548 a103: 52
## a94: 92
##
##
##
## present.residence.since property age.in.years other.installment.plans
## Min. :1.00 a121:282 Min. :19.0 a141:139
## 1st Qu.:2.00 a122:232 1st Qu.:27.0 a142: 47
## Median :3.00 a123:332 Median :33.0 a143:814
## Mean :2.85 a124:154 Mean :35.5
## 3rd Qu.:4.00 3rd Qu.:42.0
## Max. :4.00 Max. :75.0
##
## housing number.of.existing.credits.at.this.bank job
## a151:179 Min. :1.00 a171: 22
## a152:713 1st Qu.:1.00 a172:200
## a153:108 Median :1.00 a173:630
## Mean :1.41 a174:148
## 3rd Qu.:2.00
## Max. :4.00
##
## number.of.people.being.liable.to.provide.maintenance.for telephone
## Min. :1.00 a191:596
## 1st Qu.:1.00 a192:404
## Median :1.00
## Mean :1.16
## 3rd Qu.:1.00
## Max. :2.00
##
## foreign.worker good.loan id
## a201:963 badloan :300 Min. : 1
## a202: 37 goodloan:700 1st Qu.: 251
## Median : 500
## Mean : 500
## 3rd Qu.: 750
## Max. :1000
##
Con la finalidad de observar el comportamiento de las variables que contiene la base, se obtienen las gráficas correspondientes para cada una de las variables.
Asimismo, se obtienen gráficas de la relación entre cada par de variables.
Dado que en nuestra base arreglada los nombres de las categorías de algunas variables son muy largos, vamos a utilizar la base codificada original para que las gráficas se puedan observar mejor.
## R version 3.1.1 (2014-07-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252
## [3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C
## [5] LC_TIME=Spanish_Mexico.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] corrplot_0.73 stringr_0.6.2 lubridate_1.3.3
## [4] directlabels_2013.6.15 quadprog_1.5-5 ggplot2_1.0.0
## [7] dplyr_0.2 tidyr_0.1 plyr_1.8.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5
## [5] formatR_1.0 gtable_0.1.2 htmltools_0.2.4 knitr_1.6
## [9] labeling_0.3 MASS_7.3-33 memoise_0.2.1 munsell_0.4.2
## [13] parallel_3.1.1 proto_0.3-10 Rcpp_0.11.2 reshape2_1.4
## [17] rmarkdown_0.2.64 scales_0.2.4 tools_3.1.1 yaml_2.1.13